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Recently there has been much interest in data that, in statistical 
language, may be described as having a large crossed and severely 
unbalanced random effects structure. Such data sets arise for rec- 
ommender engines and information retrieval problems. Many large 
bipartite weighted graphs have this structure too. We would like to 
assess the stability of algorithms fit to such data. Even for linear 
statistics, a naive form of bootstrap sampling can be seriously mis- 
leading and McCullagh [Bernoulli 6 (2000) 285-301] has shown that 
no bootstrap method is exact. We show that an alternative bootstrap 
separately resampling rows and columns of the data matrix satisfies 
a mean consistency property even in heteroscedastic crossed unbal- 
anced random effects models. This alternative does not require the 
user to fit a crossed random effects model to the data. 

1. Introduction. Many important statistical problems feature two inter- 
locking sets of entities, customarily arranged as rows and columns. Unlike 
the usual cases by variables layout, these data fit better into a cases by 
cases interpretation. Examples include books and customers for a web site, 
movies and raters for a recommender engine, and terms and documents in 
information retrieval. Historically data with this structure has been studied 
with a crossed random effects model. The new data sets are very large and 
haphazardly structured, a far cry from the setting for which normal the- 
ory random effects models were developed. It can be hard to estimate the 
variance of features fit to data of this kind. 

Parametric likelihood and Bayesian methods typically come with their 
own internally valid methods of estimating variances. However, the crossed 
random effects setting can be more complicated than what our models an- 
ticipate. If in IID sampling we suspect that our model is inadequate, then 
we can make a simple and direct check on it via bootstrap resampling of 
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cases. We can even judge sampling uncertainty for computations that were 
not derived from any explicit model. 

We would like to have a version of the bootstrap suitable to large un- 
balanced crossed random effect data sets. Unfortunately for those hopes, 
McCullagh (2000) has proved that no such bootstrap can exist, even for the 
basic problem of finding the variance of the grand mean of the data in a 
balanced setting with no missing values and homoscedastic variables. 

McCullagh (2000) included two reasonably well performing approximate 
methods for balanced data sets. They yielded a variance that was nearly 
correct under reasonable assumptions about the problem. One approach was 
to fit the random effects model and then resample from it. That option is not 
attractive for the kind of data set considered here. Even an oversimplified 
model can be hard to fit to unbalanced data, and the results will lack the 
face value validity that we get from the bootstrap for the IID case. The 
second method resampled rows and columns independently. This approach 
imitates the Cornfield and Tukey (1956) pigeonhole sampling model, and 
is preferable operationally. We call it the pigeonhole bootstrap, and show 
that it continues to be a reasonable estimator of variance even for seriously 
unbalanced data sets and inhomogenous (nonexchangeable) random effects 
models. 

In notation to be explained further below, we find that the true variance 
of our statistic takes the form (va&a + u b®b + u e)/N, where va and vb 
can be calculated from the data and satisfy 1 <C v -C N in our motivating 
applications. A naive bootstrap (resampling cases) will produce a variance 
estimate close to {a\ A- o 2 B + o 2 E )/N and thus be seriously misleading. The 
pigeonhole bootstrap will produce a variance estimate close to {{va + 2)cr^ + 
{ub + 2)cj^ + 3a E )/N. It is thus mildly conservative, but not unduly so in 
cases where each v 3> 2 and a\ does not dominate. 

McCullagh (2000) leaves open the possibility that a linear combination 
of several bootstrap methods will be suitable. In the present setting the 
pigeonhole bootstrap overestimates the variance by twice the amount of 
the naive bootstrap. One could therefore bootstrap both ways and subtract 
twice the naive variance from the pigeonhole variance. That approach, of 
course, brings the usual difficulties of possibly negative variance estimates. 
Also, sometimes we do not want the variance per se, just a histogram that 
we think has approximately the right width, and the variance is only a con- 
venient way to decide if a histogram has roughly the right width. Simply 
accepting a bootstrap histogram that is slightly too wide may be prefer- 
able to trying to make it narrower by an amount based on the naive vari- 
ance. 

Many of the motivating problems come from e-commerce. There one may 
have to decide where on a web page to place an ad or which book to recom- 
mend. Because the data sets are so large, coarse granularity statistics can be 
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estimated with essentially negligible sampling uncertainty. For example, the 
Netflix data set has over 100 million movie ratings and the average movie 
rating is very well determined. Finer, subtle points, such as whether classical 
music lovers are more likely to purchase a Harry Potter book on a Tuesday 
are a different matter. Some of these may be well determined and some will 
not. An e-commerce application can keep track of millions of subtle rules, 
and the small advantages so obtained can add up to something commercially 
valuable. Thus, the dividing line between noise artifacts and real signals is 
worth finding, even in problems with large data sets. 

The outline of this paper is as follows. Section 2 introduces the notation 
for row and column entities and sample sizes, including the critical quantities 
va and vr, as well as the random effects model we consider and the linear 
statistics we investigate. Section 3 introduces two bootstrap models: the 
naive bootstrap and the pigeonhole bootstrap. Section 4 derives the variance 
expressions we need. Section 5 presents a small example, using the bootstrap 
to determine whether movie ratings at Netflix that were made on a Tuesday 
really are lower than those from other days. There is a discussion in Section 6 
including application to models using outer products as commonly fit by the 
SVD. The Appendix contains the proof of Theorem 3. Shorter proofs appear 
inline, but can be easily skipped over on first reading. 

2. Notation. The row entities are i = 1, . . . ,R and the column entities 
are j = 1, . . . , C. The variable Zy G {0, 1} takes the value 1 if we have data 
for the combination and is otherwise. The value Xij G M. d holds the 
observed data when Zu = 1 and otherwise it is missing. To hide inessential 
details we will work with d=l, apart from a remark in Section 4.4. 

The data are Xij for i = 1, . . . , R and j = 1, . . . , C for those ij pairs with 
Zij = 1. The number of times that row i was seen is nj, = Y^f=i Similarly, 
column j was seen n»j = J2iL\ Zij times. The total sample size is iV = n., = 



In addition to the Rx C layout with missing entries described above, we 
can also arrange the data as a sparse matrix via an iV x 3 array S with ^th 
row (Ie,Jg,Xe) for £ = 1,...,N. The value Xg in this layout equals Xi t j t 
from the Rx C layout. The value Ig = i appears nj. times in column 1 of S, 
and similarly, Ji = j appears n,j times in column 2. 

The ratios 



prove to be important later. The value of va is the expectation of n^, when 
i is sampled with probability proportional to n^. If two not necessarily 
distinct observations having the same i are called "row neighbors," then va 
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is the average number of row neighbors for observations in the data set. 
Similarly, ug is the average number of column neighbors. 

In the extreme where no column has been seen twice, every n»j = 1 and 
then i/r = 1- In the other extreme where there is only one column, having 
n.i = N, then ub = N. Typically encountered problems should have 1 <C 
vb *C N and 1 < ^ C iV. For example, an R x C table with no missing 
values has ub = C and we may expect both R and C to be large. Often 
there will be a Zipf-like distribution on n 9 j, and then for large problems 
we will find that 1 <C vb *C N. Similarly, cases with 1 <C va <C N are to be 
expected. 

For the Netflix data set, v for customers is about 646 and v for movies 
is about 56,200. As we will see below, these large values mean that naive 
sampling models seriously underestimate the variance. 

We also need the quantities 



Here is the probability that a randomly chosen data point has a "row 
neighbor" in column j and an analogous interpretation holds for /ij,. If 
column j is full, then fi,j = 1. Ordinarily we expect that most, and perhaps 
even all, of the \i»j will be small, and similarly for //j,. 

2.1. Random effect model. We consider the data to have been generated 
by a model in which the pattern of observations has been fixed, but those 
observed values might have been different. That is, Zij are fixed values for 
i = 1, . . . , R and j = 1, . . . , C. The model has 

(1) Xij = [i + ai + bj +Sij, 

where [i is an unknown fixed value and eij, bj and are random. 

In a classical random effects model (see Searle, Casella and McCulloch 
(1992), Chapter 5) we suppose that ai ~ N(0,a\), bj ~ N(0,a B ) and ~ 
-/V(0,<t|;) all independently. We relax the model in several ways. By tak- 
ing a, ~ (0,cr\) for i = 1, . . . ,R, we mean that a\ has mean and variance 
a\ but is not necessarily normally distributed. Similarly, we suppose that 
bj ~ (0,cr^) and that e%j ~ (0,a E ). We refer to this model below as the ho- 
mogenous random effects model. The homogenous random effects model is 
a type of "superpopulation" model as often used for sampling finite pop- 
ulations. In a superpopulation model we suppose that a large but finite 
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population is itself a sample from an infinite population that we want to 
study. 

Next, there may be some measured or latent attributes making ai more 
variable than ay , for i ^ i' . We allow for this possibility by taking Oj ~ 
(0, cr^), where °~a(i)i ■ ■ ■ ' a A(R) are variances specific to the row entities. 
Similarly, bj ~ (0,cr| (i) ) and ~ (0, a\ {i 

The variables a^, bj and Eij are mutually independent. That condition can 
be relaxed somewhat, as described in Section 6. 

The choice to model conditionally on the observed Zij values is a prag- 
matic one. The actual mechanism generating the observations can be very 
complicated. It includes the possibility that any particular row or column 
sum might sometimes be positive and sometimes be zero. By conditioning 
on Z^, we avoid having to model unobserved entities. Also, in practice, one 
often finds that the smallest entities have been truncated out of the data in 
a preprocessing step. For example, rows might be removed if rij, is below a 
cutoff like 10. A column entity that is popular with the small row entities 
might be seriously affected, perhaps even to the point of falling below its 
own cutoff level. Similarly, the large entities are sometimes removed, but 
for different reasons. In information retrieval, one often removes extremely 
common "stop words" like "and," "of" and "the" from the data. Working 
conditionally lets us avoid modeling this sort of truncation. 

2.2. Linear statistics. We focus on a simple mean 

1 1 N 

i j 1=1 

A bootstrap method that gives the correct variance for a mean can be ex- 
pected to be reliable for more complicated statistics such as differences in 
means, other smooth functions of means and estimating equation parameters 
9 defined via 

1 1 N 

o = ^EE WW) = xz2 f( x *> °y 

i j 1=1 

Conversely, a bootstrap method that does not work reliably for linear statis- 
tics like a mean cannot be trusted for more complicated usages. 

Lemma 1. Under the random effects model described above, 
I / R C R C 

(2) Vre(P'x) = Jp\y2 n 1»°A{i) + /Z n »3 a B(i) + Z V a E(i,j) 
\i=l j=l i=l j=l 
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Under the homogenous random effects model, 

(3) VMM = VA°£ + VB § + (J f. 

Proof. Because the di, b{ and uncorrelated, 
/ 1 R C \ 

Vre(£z) = VkE \-Tr ]C Zi i(V + a i + b j + £ ij) 
V V i=lj=l J 

which reduces to (2). In the homogenous case (2) further reduces to (3). □ 

In the nonhomogenous case, the unequal variance contributions for row 
entities are weighted proportionally to raf.. Thus, when the frequent entities 
are more variable than the others, care must be taken estimating variance 
components for the homogenous model. Using a pooled estimate o\ that 
weights entities equally would lead to an underestimate of the variance of 

fax- 

3. Bootstrap methods. We would like to bootstrap the data in such a 
way that the variance of the bootstrap resampled value ft* approximates 
Vre(Ae)- Even better, we would like to do this without having to model the 
details of the random effects involved in Xu and without having to explicitly 
account for the varying rij. values. 

Here we define a naive bootstrap, which treats the data as IID, and the 
pigeonhole bootstrap. The latter resamples both rows and columns. Neither 
of these bootstraps faithfully imitates the random effects mechanism gener- 
ating the data. Also, neither of them holds fixed the sample sizes raj, and 
the latter does not even hold ./V fixed. Thus both bootstraps must be tested 
to see whether they yield a serious systematic error. 

3.1. Naive bootstrap. The usual bootstrap procedure resamples the data 
IID from the empirical distribution. It can be reliable even when the under- 
lying generative model fails to hold. For example, in IID regression models, 
resampling the cases gives reliable inferences for a regression parameter even 
when the regression errors have unequal variance. 

It would be naive to apply IID resampling of the N observed data points 
to the random effects setting, because the Xij values are not independent. 
Under such a naive bootstrap, (1$, Jf,Xf) is drawn independently and uni- 
formly from the N rows of S, for £= 1, . . . , N. Then 

1 N 
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Lemma 2. The expected value in the random effects model of the naive 
bootstrap variance of ft* is Sre(Vnb(Ax))j which is equal to 

(4) 

+ 772 E (1 - + ^2 E E z '.i4 ; ,,, : • 

Under the homogenous random effects model, 

Proof. The naive bootstrap variance of /t* is s^/N, where s 2 x = (1/iV) : 
X^=i(-3Q — Ax) 2 - Using the [/-statistic formulation, we may write it as 

1 N N 

= ^3 E E E E Z 'J Z ''r ( Q i - a i' + b 3 - h 3' + e V - e i'j') 2 - 
i j V j' 

Then under the random effects model, 
^re(Vnb(Ax)) 

^2 E E E ZijZi>j'E(ai - ai> + 6j - 6j/ + e»j - e^y ) 2 



2iV 3 

1 3 v 0' 

= ^3 E E E E z y %' (^^ + 

+ (1 - ii=t'ii=j')(°i(ij) + 

= ^E f 1 - ^r) + ^2 E (1 - 

i 3 

« J U 

The homogenous case shows the differences clearly. The error term a\ 
gets accounted for correctly, but not the other terms. Where the row variable 
really contributes vao~\/N to the variance, the naive bootstrap only captures 
o~ A {l — ua/N)/N of it. It underestimates this variance by a factor of va/0- ~ 
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ua/N) ~ VAi which may be substantial. The variance due to the column 
variables is similarly under-estimated in the naive bootstrap. 

3.2. Pigeonhole bootstrap. The naive bootstrap fails because it ignores 
similarities between elements of the same row and/or column. A more prin- 
cipled bootstrap would be based on estimating the component parts of the 
random effects model, and resampling from it. However, finding a good way 
to estimate such a model can be burdensome. Normal theory random effects 
models for severely unbalanced data sets are hard to fit. More worryingly, 
the data may depart seriously from normality, the variances might be 
nonconstant, and could even be correlated somehow with the sample sizes 

m,. 

The pigeonhole bootstrap is named for a model used by Cornfield and Tukey 
(1956) to study balanced anovas with fixed, mixed and random effects. We 
place the data into an R x C matrix, resample a set of rows, resample a set 
of columns, and take the intersections as the bootstrapped data set. The 
original pigeonhole model involved sampling without replacement. The pi- 
geonhole bootstrap samples with replacement. The appeal of the pigeonhole 
bootstrap is that it generates a resampled data set with values taken from 
the original data. There is no need to form synthetic combinations of row 
and column entities that were never observed together, nor response values 
Xij that were not observed. 

Formally, in the pigeonhole bootstrap we sample rows r* IID from U{1, . . . , R} 
for i = 1, . . . , R and columns c| IID from U{1, . . . , C} for j = 1, . . . , C. Rows 
and columns are sampled independently. 

The resampled data set has Z*j = Z r * c * and when Z*j = 1, we take X*j = 

X r * c *. The bootstrap sample sizes are n*. =J2j=iZ*j, n* 9 - = Y^f=i^tj an d 

The bootstrap process above is repeated independently, some number B 
of times. 

Row i = 1 in Z*j does not ordinarily correspond to the same entity as 
row % = 1 in Zy . Should we want to keep track of the number of times that 
original row entity r appeared in the resampled data, we would use 

R C R 
i=l j=l i=l 

Similarly, column c appears n* c = J2f=i n lj^c*=c times among the resampled 
values. 

4. Variances. Here we develop the variance formulas for pigeonhole boot- 
strap sampling. It is convenient to consider the entities as belonging to a 
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large but finite set. Then we may work with population totals, a concept 
that makes no sense for an infinite pool of entities. 

There are two uncertainties in a bootstrap variance estimate. One is sam- 
pling uncertainty, the variance under bootstrapping B times of our variance 
estimate. The other is systematic uncertainty, the difference between the 
expectation of our bootstrap variance estimate and the variance we wish to 
estimate. We focus on the latter because the resampling model differs from 
the one we assume has generated the data. The former issue can be helped 
by increasing B, and will be less severe for large N. It will be possible to 
construct examples where sampling fluctuations dominate, but we do not 
consider that case here. McCullagh (2000) also chose to compare expected 
bootstrap variance to sampling variance. 

4.1. Variance of totals in pigeonhole bootstrap. The total value of X over 
the sample is T x = J2iJ2j z ijXij. I n bootstrap sampling, the total of X* is 

rp* y* T^* 

1 x 2—ii 2—ij ij ij' 

LEMMA 3. Under pigeonhole bootstrap sampling, 

E PB (T*) = T X and E PB (N*) = N. 

Proof. First, E{T*) = £f=i £f =1 Eb^X^) = RCE PB (Z* n X* n ) by sym- 
metry. The first result then follows because Epb^Z^X*^ = 1/(RC) J2i J2j %ij x 
Xij . The second result follows on putting N = T Z and N* = T* and consid- 
ering the case X^ = Zij . □ 

Theorem 1. Under pigeonhole bootstrap sampling, 

^ro = (^-±-±>;+(i4)EiS. 

3 » 3 

where T xi , = Y$=i z ij x ij and T x.j = £|Li Z^X^. 
PROOF. We write 

e pb((t*) 2 ) = ^2 zZYl E ~pv( z ij x ij z i>j' x h'") 

i j i> ji 

and then split the sum into four cases depending on whether i = i' or not 
and whether j = j' or not. By symmetry, we only need to consider i, j, i' 
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and j' equal to 1 or 2. Thus, 

E BB ({T*f) = R(R - 1)C(C - l)EMZliXxiZZ2Xh) 
+ RC(C - ^EMZliX^X^) 
+ R(R - yCEpB^X^XZJ 
+ RCE YB {Zl 1 X$ 1 Z* 11 Xt 1 ) 

= R(R - 1)C(C - 1)-^ + RC(C - l)^Yl T xi. 

i 

+ R(R - l ) C -g2(J Y T x»j + RCr RC Y Y Z ij X ij ' 
j i j 

which, combined with Lemma 3, yields the result. □ 

4.2. Ratio estimation. The simple mean of X is jl x = T x /N . Under boot- 
strap resampling, we generate fi* = T*/N*. The mean and variance of /}* 
are complicated because they are ratios. The bootstrap sample size N* , 
appearing in the denominator, is not constant. 

There is a standard way to handle ratio estimators in sampling theory 
[Cochran (1977)]. It amounts to use of the delta method. The approximate 
variance of fi* using ratio estimation takes the form 

(6) Vpb(A*0 = j^EpMt: - " 



Theorem 2. Under pigeonhole bootstrap sampling, 
1 -^)E^.(^.-Ax) 2 +(l-^) 

+ Y Y Zi 3 ( X V ~ Aie) 2 
i j 

where X i9 = £V =1 Z, ; .V, ; and X 9j = Y^x ZijXij/n.j. 

Proof. Let Y iA = X {j - T x Z i:j /N when = 1. Then T y = T x - T X T Z /N = 0. 
Also, T; = T* - T X N*/N and so E PB {(T* - T X N*/N) 2 ) = Vp B (T£). 
From Theorem 1 applied to Y instead of X we have 



1 - c) Yl^yi, + (i - £ ) Y T k ■ YY^y, 
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We find that Xij — ZijT x /N = Xij — fi x whenever = 1, and so 

E T vi. = E (e - % r */^)l = E f E - £*)) 

i i \ j / i \ j / 

i 

and a similar expression holds for EjT^.. Also, 

E E = E E / 'j {X '.i - /^) 2 - 

i j i j L- 1 

Next we study the expected value under the random effects model of the 
bootstrap variance Vpb(A^)- We get an exact but lengthy formula and then 
apply simplifications. 

Theorem 3. The expected value under the random effects model of the 
pigeonhole variance is 



#re(Vpb(££)) = J^2 



E a A(i) X t + E a B(j) X f + E E z '/ f7 /: ,, ; : A ''j 

i j i j 



where 

Af = ( 1 - ± U fl - 2^ + + f 1 - 41 (th. " 2/,,n, + ^ 



C7 "V N NJ \ RJ\ ^ N 

n uj v B \ ( l\( v A n\j 



and for Zij = 1 , 



h3 V C/V JV NJ \ RJ\ N N J N' 
Proof. See the Appendix. 



The variance expression above is unwieldy. We use the notation ~ to 
indicate that terms like 1/R, 1/C, n^/N, n,j/N, i>a/N, vb/N, fi l9 and fi,j 
are considered negligible compared to 1 , as they usually are in the motivating 
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applications. We do not suppose that ni./n?. is negligible because some or 
even most of the rii, could be small. Under these conditions, 

\J ps nlj + 2n,j and for = 1 , 

\fj ps 3. 

Corollary 1. 

£re(Vpb(£;)) 



(7) 



1 

iV2 



X^A(i)K ? . + 2n *») 

. i 

+ Yl a B{j) ( n lj + 2n.j) + 3 ]T ]T X U fT h>..r 



and under the homogenous random effects model, 

E RE (V PB (fil)) ps ^(a> A + 2) + + 2) + 34 w) ). 

Proof. 

E RE (V PB (fu* x )) 



1 

iV2 
1 

W 



■ i j i j 



+ H a m ( n -j + 2n »i) + 3 Zi 3 a E(i,j) 

j i j 

The specialization to the homogenous case follows easily. □ 



The variance contribution from is overestimated by a factor of 3 in 
the pigeonhole bootstrap. In the homogenous case this overestimate is not 
important when a\ <C m.ax(vAO'\, vbo 2 b)- When va and vb are large, it only 
takes a small amount of variation in a\ and a B to make a\ unimportant. 
A similar conclusion follows for the inhomogenous case in terms of appro- 
priately weighted averages of the variances. 

The average value of the variance in equation (7) tracks very closely with 
the desired random effects variance of £i x given by (2), even when the effects 



PIGEONHOLE BOOTSTRAP 13 

are heteroscedastic. Where the latter has nj m , the former has n?. + 2n,., and 
similarly for n\-. Outside of extreme cases J2i n i» a A(i) ^ J2i n i» a A(i) • 

In some applications some or many of the //j, and \x 9 j may be nontrivially 
large. In recommender settings, a small number of books or movies may have 
been rated by a large fraction of people, or some people may have rated an 
astonishingly large number of items. In information retrieval, some terms 
might appear in most documents, as, for example, when we choose to retain 
the stop words. Under such conditions, we get a slight variance reduction: 

and 

Af ps nlj + 2(1 - n.j)n,j, 

while \fj remains approximately 3. But when /ij. is not small, then we may 
reasonably expect nj, to be large and nf 9 ^n^.. Thus, the approximation 
in Corollary 1 is still appropriate. 

4.3. Mean consistency. The expression ps conveys what we ordinarily 
expect to be the important terms. We find EreVpb(£Lx) ~ Vbe(P>x) or 
EreVpb{P> x ) /VREifi-x) ps 1. However, the earlier section left open the pos- 
sibility of extreme cases where J2i n i» a A(i) was no * negligible compared to 
Ej n ?»°A(i)- For example, suppose that cr A ^ = for all i with rii, > 1. Then 
(nf m + 2rii,)a A ^ = 3nf 9 a A ^ and the pigeonhole bootstrap could essentially 
triple the variance contribution from the row entities. 

To formulate "mean consistency" of Vpb(Ax) more carefully, we define 

fo s ( 1 1 va v B 1 1 n im n 9j \ 

(8 e N = max — , — , — ,— , — , — , max — , max— , 

\K U N JM va vb i JM j JM J 

and work in the limit as N — > oo with e^v — > 0. 
Arranging the terms in each A, we get 

= + 0(e N )) + 2m.(l - ^.)(1 + 0(e N )), 

\f = n 2 mj (l + O(ejv)) + 2n mj (l - ^)(1 + 0(e N )) 

and 

\f j = 3 + 0(e N ), 

where the implied constant in all the O(ejv) terms is independent of i and j. 
To rule out pathological heteroscedasticity, we suppose that 

< WlA < ®A(i) < Ma < oo, 
< m B < o~B{j) - M B < oo 
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and 



< m E < a E{ij) < M E <oo 



holds for all 1 < i < R = R(N) and 1 < j < C = C(N). 

Theorem 4. Suppose thata 2 A ^y o~ 2 B ^ ando~ 2 E ^^ obey the bounds above, 
and that e^v — > as N ^ oo. Then 

E{Vvb{^x)) ~ Vke(Ae) = o(e N ) 
Vre(Az) 

Proof. Gathering up the pieces, 

^ r = 0{e N ) + 2(1 + 0{e N )) x p N , 

where 



PAT 



The numerator of p/^ lies between Nitie and N(Ma + Mb + Mb), while the 
denominator is at least JV(^mA + fsmB + m£). Therefore, 

n . . M4 + Mg + M E . , 

0<pN< ; ; =0{e N ). n 

vxmA + v B m B + tue 

4.4. Covariances. The variance expressions in this paper generalize in 
an unsurprising way to covariances of pairs of responses. The simplest way 
to express this is to suppose that Xij £ M. d . Then we may generalize o~ 2 A ^y 

o~ 2 B ^ and v E n j\ to be d x d covariance matrices. The variance formulas go 

through as above. Expressions like Ei n i»°~A(i) ^ Ei n i»°A(i) then mean that 

Ei ni,u'a 2 A ^u <C Ei n i» uf(7 A(i) u f° r an €E with ||u|| = 1. 

5. Netflix movie ratings example. As an example of a small effect near 
the uncertainty level, we consider the day of the week effect in movie rat- 
ings for the Netflix data. This data set is described and is available from 
http : //www . netf lixprize . com/ index. It has 100,480,507 ratings on 17,770 
movies from 480,189 customers. As mentioned above, v for customers is 
about 646 and v for movies is about 56,200. The number of ratings per 
customer ranges from 1 to 17,653. The number of ratings for movies ranges 
from 3 to 232,944. 
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5.1. Day of the week effect. It would be interesting to examine the effects 
of demographic variables on movie ratings, but for privacy purposes those 
are not included in the data. The data set does, however, supply a date for 
each rating. For each day of the week, we may find the average movie rating 
given out. The smallest value is 3.595808 for Tuesdays and the largest is 
3.616449 for Sundays. The day of the week effect is very small. 

Perhaps the movie ratings given out on Sunday do tend to be larger 
than those given out on Sunday. If so, we might investigate whether this 
arises from a different mix of movies being rated that day, a different set 
of customers rating on that day, or some subtle interaction. But before 
doing such followup, we should check whether the difference might just be 
a sampling fluctuation. For such a large data set, sampling fluctuations are 
expected to be small. But the observed effect is also quite small, and the 
sampling fluctuations include random effects from movies and customers 
that can make them much larger than we were used to in the IID setting. 

Figure 1 shows results for 10 pigeonhole bootstrap samples. In each sample 
the means for all 7 days of the week were recorded. There is clearly a bias in 
the bootstrap resampling. The average score on any given day in resampling 
is a ratio estimate of the total of all scores given for that day divided by the 
number of ratings for that day. The bias is very small in absolute terms, but 
not compared to the pigeonhole bootstrap standard deviation. 

A day versus day comparison is more interesting than the absolute level 
for a given day. To compare Tuesday and Sunday, we look at the 10 paired 
average scores. These are shown in Figure 2. The solid point is 0.0206 units 
above the forty-five degree line, indicating that Sunday scores average that 
much higher than Tuesday scores. The resampled points average 0.0214 units 
above the line. This is very close to the sample difference. The biases for 
the two days' scores have almost completely cancelled out, so that some 
resampled points are farther from the line than the original point while 
others are closer. The average resampled difference in means is about 8.16 
times as large as the standard deviation of the resampled differences. The 10 
bootstrap differences are independent, and should be nearly normal because 
of the large sample sizes involved. Then Pr(|i (9) | > 8.16) = 1.88 x 10" 5 . This 
is small enough for us to conclude that the difference is real, even if we 
take account of having selected the most significant of all 21 day to day 
comparisons. 

5.2. Parameters and hypothesis. This section makes clear what hypoth- 
esis is being tested by the bootstrap analysis and what are the underlying 
parameters. Then it looks at how well the random effects model might fit 
the present setting. 

Let Yij be the score for a movie and let Zij be an indicator that it was 
observed. Introduce binary covariates Dj- 10 taking the value 1 if and only if 
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Fig. 1. The horizontal axis depicts the day of the week from Monday at to Sunday at 
6. The vertical axis has average movie rating scores. For each day the solid dot shows that 
day's average movie rating in the original data set. The open dots show the average in 
each of 10 pigeonhole bootstrap samples. 



the ij measurement happened on Tuesday. Similarly, let D^ n be the day of 
week indicator for Sunday. 

The sample average for Tuesday is 



(9) MTue 



ZijDj^Yij 



and /isun is defined similarly. We interpret /tTue above as an estimate of 

(10) /iTue - 



so that /xtuc is the solution to -EreCX^ Zij Dj^ e (Yij — /iTuc)) = 0. 

The null hypothesis Ho being tested by the pigeonhole bootstrap analysis 

is that /i S un - ^Tuc = 0. 

In bootstrapping /ixue plays the role of the parameter and /^ ue plays that 
of the estimate. The parameter /^tuc is well defined so long as Pr(X^- ZijDjj 10 - 
0) > 0. We have neglected the possibility that the denominator in /*Tu e * s 
zero. In practice, one might add a small constant to the denominator of each 
/t^g, or condition on the denominator being positive. Such small resampled 




3.59 3.60 3.61 3.62 



Tuesday 

Fig. 2. The horizontal axis shows the average movie rating given on Tuesday. The ver- 
tical axis shows Sunday. The open circles are from 10 pigeonhole bootstrap samples. The 
solid point is from the original data. For each day the solid dot shows that day's aver- 
age movie rating in the original data set. The open dots show the average in each of 10 
bootstrap samples. 



denominators are, however, a sign that the delta method approximation for 
the variance of the ratio may be inaccurate. 

The simplest path between the random effects model and the data set is 
to suppose that the triple (Yij , D]j ie , Df^ 111 ) approximately follows a crossed 
random effects model of the form studied in this paper. Such a model is 
somewhat unnatural though, because two of the variables are binary. 

All we need, however, is that the first term in the linearization of the 
test statistic follows approximately a crossed random effects model. The 
linearized statistic is not binary. 
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Dropping the subscript and superscript for Tuesday, and writing the ratio 
in (9) as T zdy /T zd , the delta method linearization of this ratio is 

T z d y — E(T zdy ) ^ T z d — E(T z d) 
M E{T zd ) ~ E{T zd y 

= const + J2 ZijDi^YijEiTzd)- 1 - E(T zd y 2 ). 

The linearization for Tuesdays takes six different values, corresponding to 
five different values when Dj} ie = 1 and one value for all cases with Dj^ e = 
0. After linearizing the Sunday versus Tuesday difference, there are eleven 
different values arising as five for Sunday ratings, five for Tuesday ratings, 
and one for ratings made on the other five days. 

5.3. A deeper look into the data. Sunday ratings are about 0.02 points 
higher than Tuesday ratings. This difference is small, but statistically sig- 
nificant, even allowing for random customer and movie effects. 

What follows is an informal data analytic exploration of the nature of this 
discrepancy. Making the analysis formal would take us somewhat beyond the 
results proved here. 

Several explanations for the day of week effect are plausible. One is that 
the harder to please customers rate more often on Tuesday. A second is 
that a given customer making a rating supplies a lower rating if the day is 
Tuesday. There are also versions of these explanations centered on movies. 
The movies rated on Tuesday might tend to be less popular, or a given 
movie getting rated on a Tuesday might tend to get a lower rating than on 
a Sunday. 

A simple proxy for how well a movie is liked is the average score it gets. 
Similarly, the generosity level of a customer may be judged by the average 
score that he or she gives. There were 17,784,852 Tuesday ratings and the 
average over these ratings of our simple movie score is 3.600 (rounded to 
the nearest l/1000th), slightly higher than the simple average of all Tues- 
day scores. The comparable numbers for Sunday are 10,730,350 and 3.616. 
Therefore, it appears that slightly more popular movies are being rated on 
Sundays than on Tuesdays. This simple analysis yields an estimated gap of 
0.016 compared to the observed gap of 0.020. A customer-based version of 
this analysis yields a gap of only 0.010 (from 3.610 versus 3.600). 

By this analysis, the effect seems to be due slightly more to which movies 
are being ranked than to who is doing the ranking. If we add both effects, we 
get a predicted gap of 0.026. This value is higher than what was observed, 
indicating that some amount of double counting is taking place. This double 
counting is consistent with a strong feature in the data. Popular movies 
get more ratings and also higher ratings. Very active customers give more 
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ratings, have a harder time restricting themselves to just the popular movies, 
and, not surprisingly, tend to give out lower ratings. Thus, knowing that 
Tuesday has the less popular movies already leads one to suspect it will 
have the busier and, hence, less generous customers. 

Another analysis looks at customers who gave ratings on both Tuesday 
and Sunday. For each such customer we can measure their average Sunday 
rating and subtract their average Tuesday rating. This gives each customer 
a Sunday versus Tuesday differential. The mean over customers of this dif- 
ferential is 0.016. Perhaps coincidentally, this matches the average movie 
effect. The mean over movies of a comparable movie differential is —0.008. 
It has an unexpected sign, meaning that by this measure Sunday scores are 
lower than Tuesday scores. A more proper analysis uses a weighted average 
of customers or movies, with weights depending on how many data points 
they contribute. The proper weighting may be a matter of debate, but, for 
simplicity, we use the harmonic mean of n-Tuc and ns U n where these are the 
number of Tuesday and Sunday ratings made by the customer. Now the 
weighted mean differential is 0.009. Taking this at face value, customers 
seem to be slightly harder to please on Tuesday. A similar weighted analysis 
by movies gives a differential of 0.011, so movies tend to get lower scores on 
Tuesday. 

The pattern in the differentials is somewhat more subtle than the analysis 
above describes. For a given customer, let Y denote the average of their 
Sunday scores minus the average of their Tuesday scores and let X denote 
the simple average of those two average scores. A plot of Y versus X has 
a great many data points, but a spline smooth, using the harmonic mean 
weights described above shows a pattern. Generous customers are even more 
generous on Sundays than Tuesdays. But hard to please customers give even 
lower ratings on Sundays than Tuesdays. 

In other words, the customers are slightly more extreme on Sundays than 
they are on Tuesdays. Because high scores are more common, this raises the 
average score on Sunday versus Tuesday. A comparable analysis by movie 
shows that unpopular movies get even lower scores on Tuesdays, popular 
movies get about the same score on both days, and intermediate movies get 
somewhat higher scores on Tuesdays. These curves are shown in Figure 3. 

The informal data analysis above gives some support to all four explana- 
tions offered. Tuesday appears to get more of the tougher customers and the 
weaker movies. Furthermore, a given customer or movie seems to result in 
a lower score if the rating is made on Tuesday. From the figure we see that 
the within customer or within movie day of the week effect can be positive 
or negative and may be much larger than 0.002 in absolute value. 

6. Discussion. Ultimately we would like to have a trustworthy bootstrap 
analysis for elaborate methods such as the spectral biclustering procedure 
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of Dhillon (2001), among others. Running five or ten repeats would give 
insight as to which features of the analysis remain stable and which ones 
might be idiosyncratic to the data set at hand. There is a large complexity 
gap between the output of such methods and the simple mean considered 
here. It does, however, seem reasonable to rule out methods that cannot 
handle the global mean and focus further research on one that does. 

We would also wish to have a bootstrap that works under more flexi- 
ble settings than the additive random effects model in (1). The rest of this 
discussion presents two simple generalizations of (1) where the pigeonhole 
bootstrap can be applied and then discusses the issue of bias, and the dif- 
ference between fixed and random Z y - models. 

6.1. Relaxing independence. It is not hard to see that the same variance 
results arise if the ctj, bj and are simply uncorrelated and not necessarily 
independent. This helps us in settings where must be in a restricted set. 




1 2 3 4 5 

Movie score 

Fig. 3. These curves show how the Sunday versus Tuesday rating difference varies with 
the popularity of movies. The solid curve shows an analysis by customers, the dashed 
curve shows an analysis by movies. Suppose that a customer makes nxuc > 1 ratings on 
Tuesday with average value Yme and similarly for Sunday. Then the solid curve is a spline 
smooth on 8 degrees of freedom o/Fsun — Vtuc versus (Yg un + Yr U e)/'2 over customers, with 
weights 2/(l/nsun + 1/nsun)- The dashed curve is computed the same way, using counts 
and averages per movie. 
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For example, if the X^ values can only belong to a finite set, such as movie 
ratings {1,2,3,4,5}, then given and bj there are only 5 allowable values 
for £jj. Because these allowable values depend on + bj the error cannot 
be independent of a, and bj. It can, however, weight its allowable values 
in such a way that E{eij \ ai, bj) = and Vieij | ai, bj) = o%u jy Therefore, 
a model with mutually independent from (0,a^^^) conditionally on 
ai,...,a,R and b\ , . . . , be is plausible. In such a model the are uncorrelated 
with ai and bj. 

The relaxation does not go quite as far as we might want. For example, 
if there is an upper bound on Xy, then the largest possible a, and largest 
possible bj must sum to at most that bound, or else Eij cannot have mean 0. 

6.2. Outer product models. We might suppose that instead of the 
additive random effects model, that an outer product representation is 
more appropriate. Such SVD type models have become important in in- 
formation retrieval [Deerwester et al. (1990)] and DNA microarray analysis 
[Alter, Brown and Botstein (2000)]. 

Under such a model we write 

L 

(11) = ii + ai + bj + ^2, ^euavje + 

e=i 

where the new pieces are scalar singular values and the singular vectors 
with components un and Vjg. Ordinarily the singular vectors are fit to the 
data subject to a norm constraint. As a model for how the data might have 
arisen, we don't have to impose that constraint. We can make the pieces 
random with ua ~ (0,t^^), and Vji ~ {®, T v(je)) independently of each 
other and the aj and bj. 

The model (11) is popular in crop science [Crossa and Cornelius (2002)] 
where the rows and columns correspond to genotypes and environments. 
Surprisingly, to modern readers that model (with L= 1) is about as old as 
the earliest ANOVA papers, going back to Fisher and Mackenzie (1923). 

Writing 

L 

T)ij = ^ ^iUigVjl + Eij, 

e=i 

we find that rjij are uncorrelated with each other and with and bj. Un- 
correlated errors r\ij lead to the same variances as independent ones do. 
Therefore, we can subsume randomly generated outer products into the e^- 
term of model (1). As a consequence, we can apply the pigeonhole bootstrap 
without knowing what the value of L is, including the possibility that L = 
might be the best description of the data. 
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A very early outer product model is the one degree of freedom for non- 
additivity of Tukey (1949). In this model 

(12) Xij = (J, + a,i + bj + Xaibj + £y . 

It differs from the additive plus outer product model in having the same 
variables appear in both places. Once again, we can subsume the outer 
product part into the error because Xaibj + £ij is uncorrelated with a,, bj 
and Xai'bj + e^j when i^i', if all the a i; bj and are independent. 

6.3. Bias. It is a commonplace that biases affect overall levels much more 
than they affect comparisons. The EPA used to say about automobile effi- 
ciency that while "your mileage may vary" from what they report, reported 
differences between vehicles should be accurate. (Now they say "your mpg 
will vary") In practice with the bootstrap, we would like to know whether a 
given statistic is performing with bias like the Tuesday score, or with much 
less bias like the Sunday versus Tuesday difference. For a scalar parameter 
we can compare the resampled values to the original one. 

For our hypothetical bookseller wondering whether classical music lovers 
are more likely to purchase a Harry Potter book on a Tuesday, it now be- 
comes clear that "more likely than what?" is an important consideration. A 
contrast with other days, other books or other customer types will be better 
determined than the absolute level. 

It would be interesting to know whether the bias in the pigeonhole boot- 
strap tracks with the sampling bias in any reasonable generative model hav- 
ing random sample sizes. The mixed model (1) with fixed sample sizes does 
not allow for possibility of bias in sample means. 

6.4. Random observation patterns. The analysis here is conditional on 
the values of Z^. The conditional and unconditional variances of fi x can be 
very different. When that happens the pigeonhole bootstrap will estimate 
the conditional variance which may differ greatly from the unconditional 
one, when Zij is correlated with the response variable. 

For the Netflix data, it is clear that there are strong dependencies between 
and Yij. Most people only rate movies they've seen, and those tend 
to be ones that they like or think they will like. A few people might be 
more likely to supply ratings for the movies they do not like, hoping to 
educate the algorithm about their tastes. Probably a few people spam the 
ratings to boost some movies and/or harm others. But on the whole, positive 
correlation is expected. 

If we want to understand the values of for ij pairs that were not 
observed, then an unconditional analysis accounting for varying Zij is ap- 
propriate. If we want to predict Yu ratings that will be made later, then 
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conditioning is appropriate, because those future ratings will have similar, 
if not the same, selection bias. 

Sometimes the unconditional problem is much more interesting. In the 
extreme, the unconditional analysis is essential when all we observe are the 
Zij and we want to study co-ocurrence. But the conditional problem is 
interesting too. For example, the Netflix competition is only about predicting 
ratings that were actually made and then held out, not about predicting 
rating values that might have been made. So it is a conditional estimation 
problem. Similarly, in e-commerce, = 1 may mean that customer i saw 
an ad for product j, and the retailer studying what happens next is doing 
a conditional inference. 

In some cases the conditional and unconditional variances can be expected 
to be close. Let Z represent for i = 1, . . . , R and j = 1, . . . , C. Letting Z 
be random, we write 

V(fi x ) = E(V RE {jl x | Z)) + V(E RE (fi x | Z)). 

When Z^ = describes a "missing at random" phenomenon, then E RE (p, x \ 
Z) = fj, has zero variance. Combining missing at random with the random 
effects model, we get 

/ I R C R c 

(13) V(Ar) =eI ^2J2 n i» a A({) + Ys n % a B(i) + ^z2 Z V a E(i,j) 

\ i=l j=l i=lj=l 

which reduces to 

(14) E\Jj(u A ol + VBol + ol) 

for a homogenous random effects model. When the quantity within the ex- 
pectations in (13) or in (14) is stable under sampling of Z, then the condi- 
tional variance estimated by the pigeonhole bootstrap will be close to the 
unconditional one. 



APPENDIX: PROOF OF THEOREM 3 

Theorem 3. The expected value under the random effects model of the 
pigeonhole variance is 



£re(Vpb(A*)) = ^ 



J2 a A{i) X t + a B(j) X f + J2 J2 Z ij a E{i,j) X i,j 
i j i j 



where 



A - = I 1 - i ) nt \ l - 2 -f + ~-w ) H 1 - r ) - 2 " i " ii - + iv 
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and /or Zij = 1 , 

Af,= [l-iVl-2% + ^ 



1 j v cv v n n j 



Proof. First, 



Y Z ij X ij Z i>j X i'j 
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rij, 

N 



i> j 

Y X] -^'i (/" + a *' + + £ i'i) ( li'=i - "TT 



V 
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Therefore, 



N 

% j 



^ — ^ ^ — ^ 2 2 f 

= 2^1^ a A{i>) n i>. I lf'=i - 2 • V^^r + ^2 
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+ 1^1^ a B(j) \ Zi i ~ 2Zi i N + ^2 

* 3 
i V j 



^,^,.(1-2— + — 



N 



+EE4 (( ., j) ^fi-2^ + ^), 



and the analogous expression holds for Ere (E j n,j (X,j — fix) 2 )- Next, for 
cases with Zij = 1, ERE{{Xij — fix) 2 ) equals 

-Ere ^ a i + b j + e ij ~ jjYlYl Zi ' i' ( a i' + b 3' + e *' i' ) J 



Thus, ^RE(EiEj%(^y - fix) 2 ) equals 



E4« n-fl-^ +^-n to 



jV 2 



EE^ 



i3 a E(i,j) 



+ 



iV-1 



AT2 



Putting the pieces together and applying Theorem 2, £ i re(Vpb(A^)) equals 



1 

iV 2 



E a A(i) X t + E A f + E E Z i3 a E(i,j) X F,3 



where 



Af = 1 



C 
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1 \ f Vnn? 



RJ V — 

2 ^2 



RJ J \ N N 

1 \ / v A r? mj 

~c) \' j ~ 2fi 'i n *j + 



Af, = [l-^)(l-2^ + 



+ n *j( y 1 ~-p~J + ]^( N ~ n »^ and for Zjj =1, 
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